Claude/fix raft function registration 01 maq5c t7dvc g wg h3e n shz tu #387

shiweijiezero · 2025-11-14T19:16:18Z

Description

[Please describe the background, purpose, changes made, and how to test this PR]

Checklist

Please check the following items before code is ready to be reviewed.

Code has passed all tests
Docstrings have been added/updated in Google Style
Documentation has been updated
Code is ready for review

# Conflicts: # benchmark/config/countdown-template.yaml # docs/sphinx_doc/source/tutorial/example_step_wise.md # docs/sphinx_doc/source_zh/tutorial/example_step_wise.md # examples/agentscope_react/gsm8k.yaml # examples/agentscope_tool_react/agentscopev0_tool_react_dapo.yaml # examples/agentscope_tool_react/agentscopev0_tool_react_gsm8k.yaml # examples/agentscope_tool_react/agentscopev1_tool_react_dapo.yaml # examples/agentscope_websearch/agentscopev1_websearch_agent.yaml

## Summary - Import R3L algorithm implementation from featureA branch - Create backup directory (R3L-back) for safety - Fix critical syntax error in Countdown environment ## Changes ### 1. Imported from featureA - examples/R3L/ - All R3L configuration files for 5 environments - trinity/common/workflows/envs/R3L/ - Complete R3L workflow implementations ### 2. Backup - trinity/common/workflows/envs/R3L-back/ - Full backup before modifications ### 3. Countdown Syntax Fix **File**: trinity/common/workflows/envs/R3L/countdown/utils.py - Fixed f-string syntax error in format_countdown_prompt_with_guidance() - Error: f-string expression part cannot include a backslash - Solution: Extract split operation to variable before f-string ### 4. Environment Status - Alfworld: ✓ Complete (reference implementation) - Countdown: ✓ Fixed syntax error - DAPO: ✓ No changes needed (math_verify import is correct) - ScienceWorld: ✓ No changes needed - WebShop: ✓ No changes needed ## Testing All Python files pass syntax validation: ✓ All R3L_workflow.py files ✓ All utils.py files See R3L_Fix_Summary.md for detailed documentation.

## Summary Unified all 4 environments (countdown, dapo, scienceworld, webshop) to use the same reflection schema as Alfworld (the reference implementation). ## Changes ### Reflection Prompts (prompts/reflection.j2) All environments now use the simplified Alfworld JSON schema: - trajectory_summary: Concise overview (1-3 sentences) - root_cause_analysis: Deep causal analysis using 'why' questioning - trajectory_outcome: success | success_but_inefficient | failure - improvement_suggestion: Generalizable, context-complete principle - retry_from_step: Integer 0 to N-1 (top-level field) ### Validation Functions (utils.py::validate_reflect_report) Updated to validate the Alfworld schema: - Check for top-level fields: trajectory_summary, root_cause_analysis, trajectory_outcome - Validate improvement_suggestion and retry_from_step for non-success outcomes - Consistent validation logic across all environments - Environment-specific log prefixes for debugging ### Workflow Integration (R3L_workflow.py) Updated retry_step extraction: - Old: reflect_checklist["analysis"]["retry_strategy"]["retry_step"] - New: reflect_checklist.get("retry_from_step", 0) - Top-level field access aligns with Alfworld schema ### Environment-Specific Examples Each reflection.j2 includes domain-appropriate examples: - Countdown: Number equation problems - DAPO: Mathematical problem solving - ScienceWorld: Interactive science experiments - WebShop: E-commerce navigation ## Testing All environments pass Python syntax validation: ✓ alfworld (reference) ✓ countdown ✓ dapo ✓ scienceworld ✓ webshop ## Impact - Consistent reflection format across all R3L environments - Simplified schema reduces complexity - Easier to maintain and extend - Aligns with the carefully tuned Alfworld implementation

…unction ## Summary Completed the R3L environment unification by standardizing self_correction prompts and the reflect_report_to_guidance_prompt function across all environments to match Alfworld. ## Changes ### 1. Self-Correction Prompts (self_correction.j2) Unified all 4 environments to use Alfworld's concise template: **Before** (3 different versions): - Countdown/DAPO: Detailed instructional format with bullet points - ScienceWorld/WebShop: "Internal Monologue Directive" format - Alfworld: Simple, direct format **After** (All environments now use): ``` Your previous attempt encountered issues. Below is a reflection based on user and environment feedback: {{ report }} Apply the lessons learned from this reflection to avoid repeating the same mistakes. Do not mention or reference this guidance in your response. ``` ### 2. DAPO reflect_report_to_guidance_prompt Function Updated DAPO's complex implementation to match the simple template-based approach used by other environments: **Before** (DAPO only): - Manually constructed guidance from nested schema fields - Tried to extract from analysis.flaw_analysis.lessons_learned - 53 lines of complex string building logic **After** (All environments): ```python def reflect_report_to_guidance_prompt(report: Dict[str, Any]) -> str: report_str = json.dumps(report, indent=2, ensure_ascii=False) jinja_env = _get_jinja_env() template = jinja_env.get_template("self_correction.j2") return template.render(report=report_str) ``` ### 3. System Prompts (xxx_system.j2) **No changes needed** - These are environment-specific: - Interactive environments (alfworld, scienceworld, webshop): Use <think> + <action> - Single-step environments (countdown, dapo): Use <think> + <answer> This difference is intentional and correct. ## Testing All environments pass Python syntax validation: ✓ alfworld ✓ countdown ✓ dapo ✓ scienceworld ✓ webshop ## Impact - Complete consistency across all R3L prompts and functions - Simpler, more maintainable codebase - All environments now fully aligned with Alfworld reference implementation

## Summary Fixed 3 critical issues in R3L workflows to ensure correct training behavior and consistency across all environments. ## 🔴 Critical Fixes ### 1. Fix ScienceWorld missing max_reflect_tokens (RUNTIME ERROR) **File**: trinity/common/workflows/envs/R3L/scienceworld/R3L_workflow.py:41 - **Problem**: Used `max_reflect_tokens` at line 122 but never defined it - **Fix**: Added `self.max_reflect_tokens = 4096` in __init__ - **Impact**: Prevents AttributeError at runtime ### 2. Fix Alfworld unsafe retry_step extraction (POTENTIAL KeyError) **File**: trinity/common/workflows/envs/R3L/alfworld/R3L_workflow.py:246 - **Problem**: Used `reflect_checklist["retry_from_step"]` (unsafe dictionary access) - **Fix**: Changed to `reflect_checklist.get("retry_from_step", 0)` - **Impact**: Prevents KeyError if field is missing ### 3. Add _adjust_action_mask_for_retry to Countdown and DAPO (TRAINING DATA QUALITY) **Files**: - trinity/common/workflows/envs/R3L/countdown/R3L_workflow.py:153-192, 320-327 - trinity/common/workflows/envs/R3L/dapo/R3L_workflow.py:172-211, 339-346 - **Problem**: Missing critical method that ensures retry prefixes are excluded from training - **Fix**: Added method definition and calls in run() method - **Impact**: Ensures only improved retry attempts are trained, not failed first attempts - **Details**: - Identifies assistant response segments in action_mask - Marks first retry_step segments as non-trainable (mask=0) - Applied to both second_exp and corresponding first_exp for consistency ## Testing All environments pass Python syntax validation: ✓ alfworld/R3L_workflow.py ✓ countdown/R3L_workflow.py ✓ dapo/R3L_workflow.py ✓ scienceworld/R3L_workflow.py ✓ webshop/R3L_workflow.py ## Impact - ScienceWorld: Fixed runtime crash - Alfworld: Improved robustness - Countdown/DAPO: Correct training mask behavior (critical for SFT quality) ## Configuration Notes - max_tokens: DAPO and Countdown use 4096 (math problems need longer reasoning); other envs use 512 - max_reflect_tokens: All environments use 4096 (consistent reflection capacity)

Align webshop and scienceworld with alfworld's approach of reminding the model about the required format (<think>...</think> <action>...</action>) in every user prompt. This improves format compliance during multi-turn interactions. Changes: - webshop/utils.py: Enhanced format_observation to include format reminder - scienceworld/utils.py: Enhanced format_observation to include format reminder

…ronments This commit addresses several critical bugs and inconsistencies found during rigorous comparison of all R3L environments against the alfworld reference. Critical Bug Fixes (Priority 1): 1. alfworld/utils.py: Fixed duplicate variable check in validate_reflect_report - Line 565 was checking retry_from_step twice instead of checking both improvement_suggestion and retry_from_step 2. alfworld/utils.py: Added missing else clause for invalid trajectory_outcome - Previously would not handle invalid outcome values explicitly 3. webshop/R3L_workflow.py: Removed hardcoded absolute path - Replaced with environment variable WEBSHOP_PATH for portability Standardization Fixes (Priority 2): 1. alfworld/utils.py: Standardized validate_reflect_report function signature - Changed parameter from max_steps (optional) to total_steps (required) - Changed return type hint from tuple to Tuple for consistency - Added Tuple to imports 2. alfworld/utils.py: Added [R3L Alfworld Validation] prefix to all validation error messages for consistency with other environments 3. alfworld/prompts/self_correction.j2: Added trailing newline for consistency 4. alfworld/prompts/reflection.j2: Added environment-specific hint "(ALFWorld interactive tasks)" in improvement_suggestion field description 5. countdown/utils.py: Fixed reflect_report_to_guidance_prompt code order - Now converts report before loading template (consistent with others) - Updated docstring to full detailed version 6. scienceworld/R3L_workflow.py: Added Args section to _adjust_action_mask_for_retry docstring for consistency All environments now follow consistent patterns and conventions.

Keep the original hardcoded path as default, but allow override via WEBSHOP_PATH environment variable for flexibility in different environments.

WebShop Hardcoded Path Fixes: - grpo_workflow.py: Add environment variable WEBSHOP_PATH support with fallback - opmd_workflow.py: Add environment variable WEBSHOP_PATH support with fallback - raft_workflow.py: Add environment variable WEBSHOP_PATH support with fallback DAPO Format Reminder Enhancement: - dapo/utils.py: Add format_dapo_prompt() function for per-turn format reminders - Update first_rollout() to use format_dapo_prompt for initial prompt - Update first_rollout() feedback messages to include format reminders - Update second_rollout() to use format_dapo_prompt for initial prompt - Update second_rollout() feedback messages to include format reminders Now DAPO aligns with other environments (alfworld, countdown, scienceworld, webshop) by reminding the model about required format (<think>...</think> <answer>...</answer>) in every user turn, not just the system prompt.

RAFT Error Handling (align to alfworld pattern): - countdown/raft_workflow.py: Fix to only append successful samples (reward >= 1.0) - Success: append exp - Failure: append default_exp - Exception: append default_exp - dapo/raft_workflow.py: Same fix as countdown - scienceworld/raft_workflow.py: Same fix as countdown - webshop/raft_workflow.py: Same fix as countdown Previous behavior was appending all exps regardless of success/failure, and ignoring exceptions entirely. New behavior matches alfworld's RAFT pattern. WebShop Fixes: - webshop/grpo_workflow.py: Remove unnecessary max_reflect_tokens=4096 (GRPO doesn't use reflection, only R3L does) - webshop/raft_workflow.py: Fix registration name from "RAFT_baseline" to "raft_baseline" for consistency with other environments All environments now follow consistent RAFT training logic.

DAPO Workflow Creation: - countdown/dapo_workflow.py: Created DAPO workflow for countdown environment - Uses max_attempts=3, max_tokens=4096 - Computes overlong penalty on response length - Returns accuracy + format_score as total reward - dapo/dapo_workflow.py: Created DAPO workflow for dapo environment - Uses max_attempts=3, max_tokens=4096 - Handles 3 different prompt formats (prompt/question/problem) - Same DAPO penalty computation as other environments - scienceworld/dapo_workflow.py: Created DAPO workflow for scienceworld - Uses max_env_steps=30, max_tokens=16384 - Interactive environment with step-by-step execution - Computes overlong penalty on response tokens - webshop/dapo_workflow.py: Created DAPO workflow for webshop - Uses max_env_steps=15, max_tokens=512 - Includes WebShop environment initialization with path config - Compatible with existing webshop utils.first_rollout() All 4 DAPO workflows follow the same pattern as alfworld/dapo_workflow.py: - DAPO-style overlong penalty based on response token length - Configurable penalty parameters (penalty_factor, max_response_length, cache_length) - Total reward = accuracy_score + format_score - Clean metrics tracking (accuracy, format_score, response_length, total_reward) WebShop RAFT Fix: - webshop/raft_workflow.py: Reverted registration name from "raft_baseline" back to "RAFT_baseline" to avoid breaking config file dependencies Now all 5 environments have complete baseline implementations: R3L, GRPO, OPMD, RAFT, and DAPO workflows.

- Unified all RAFT workflow registration names to use uppercase 'RAFT_baseline_' - Updated workflow registration in trinity/common/workflows/envs/R3L for: * alfworld/raft_workflow.py * countdown/raft_workflow.py * dapo/raft_workflow.py * scienceworld/raft_workflow.py - Updated corresponding config files in examples/R3L to match: * All RAFT_7B.yaml and RAFT_1.5B.yaml files now use 'RAFT_baseline_xxx_workflow' - This ensures consistency between function registration names, __init__.py imports, and configuration file workflow_type declarations

gemini-code-assist · 2025-11-14T19:19:26Z

Summary of Changes

Hello @shiweijiezero, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the framework's capabilities by integrating advanced R3L (Reinforcement Learning from Reflection and Reasoning) workflows for diverse environments. It introduces new configurations and utility scripts to support these sophisticated learning paradigms, aiming to boost model efficiency and effectiveness through iterative self-correction. Concurrently, it refines existing system parameters by enabling key performance optimizations and adjusting synchronization settings for improved robustness.

Highlights

New R3L Workflows and Configurations: Introduced a comprehensive set of R3L (Reinforcement Learning from Reflection and Reasoning) workflows across multiple environments including Alfworld, Countdown, DAPO, ScienceWorld, and Webshop. This includes new YAML configuration files for various model sizes (1.5B, 7B) and algorithms (RAFT, GRPO, OPMD, OPMD_Reweight_Adv), along with Python scripts for dataset generation.
Performance Optimizations and Stability Improvements: Enabled prefix caching and chunked prefill in numerous existing YAML configuration files to enhance performance. Additionally, increased the sync_timeout values in several configurations to improve stability during synchronization processes.
New Algorithm and Buffer Operators: Added a new advantage function, OPMDReweightAdvGroupAdvantage, and registered it as a new algorithm type opmd_reweight_adv. Also introduced RAFTFilter and OPMDFilter for experience buffer operations, providing more granular control over data processing.
Enhanced Progress Bar Visibility: Set enable_progress_bar to True by default across StorageConfig, TasksetConfig, and ExperienceBufferConfig to provide better visibility into data loading and processing progress.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant number of new configuration files and workflows for different environments (ALFWorld, Countdown, DAPO, ScienceWorld, WebShop) under the R3L directory. It also includes several new Python scripts for data generation and new algorithm implementations like OPMDReweightAdvGroupAdvantage. Additionally, many existing configuration files have been updated to enable performance features like prefix caching and to increase timeouts.

My review focuses on the new code and configurations. I've identified several issues:

Hardcoded paths: Some data generation scripts contain hardcoded local paths, which is a critical issue for portability.
Fragile JSON parsing: Several new workflow files use a fragile method to parse JSON from model outputs, which could easily break.
Code Quality: There are some instances of redundant code and hardcoded values that should be replaced with configurable parameters.
Typo: A minor typo was found in a data generation script.

Overall, this is a substantial contribution. Addressing the identified issues will improve the robustness and maintainability of the new code.

gemini-code-assist · 2025-11-14T19:24:40Z

examples/R3L/alfworld/get_alfworld_data.py

+
+# FIX 1: 将默认值改为 None
+def create_dataset_files(output_dir, train_size=None, test_size=None):
+    alfworld_data_root = "/export/project/shiweijie/weijie/trinity/alfworld/json_2.1.1"


The path to alfworld_data_root is hardcoded. This makes the script not portable and will fail on other machines. It's better to pass this path as a command-line argument or read it from an environment variable.

gemini-code-assist · 2025-11-14T19:24:41Z

examples/R3L/scienceworld/get_sciworld_data.py

+
+if __name__ == "__main__":
+    # NOTE: Mannually set the jar path here.
+    jar_path = "/your/path/ScienceWorld/scienceworld/scienceworld.jar"


The path to the ScienceWorld JAR file is hardcoded with a placeholder value. This will cause a FileNotFoundError when the script is run. This path should be made configurable, for example, by reading it from an environment variable or a command-line argument.

gemini-code-assist · 2025-11-14T19:24:41Z

examples/R3L/scienceworld/get_sciworld_data.py

+                "jar_path": jar_path,
+            }
+            task_desc = json.dumps(task_config)
+            train_data.append({"task_desc": task_desc, "targe": ""})


There is a typo in the dictionary key. 'targe' should probably be 'target'.

Suggested change

train_data.append({"task_desc": task_desc, "targe": ""})

train_data.append({"task_desc": task_desc, "target": ""})

gemini-code-assist · 2025-11-14T19:24:41Z

examples/R3L/scienceworld/get_sciworld_data.py

+                "jar_path": jar_path,
+            }
+            task_desc = json.dumps(task_config)
+            test_data.append({"task_desc": task_desc, "targe": ""})


There is a typo in the dictionary key. 'targe' should probably be 'target'.

Suggested change

test_data.append({"task_desc": task_desc, "targe": ""})

test_data.append({"task_desc": task_desc, "target": ""})

gemini-code-assist · 2025-11-14T19:24:41Z

trinity/algorithm/advantage_fn/opmd_advantage.py

            reward_mean = torch.mean(group_rewards)
            if len(exps) == 1:
                group_baseline = torch.tensor(0.0)
+                group_rewards = torch.tensor([exps[0].reward], dtype=torch.float32)


The variable group_rewards is redefined here. It was already defined on line 122. This redefinition is redundant and can be removed.

gemini-code-assist · 2025-11-14T19:24:41Z

trinity/common/workflows/envs/R3L/alfworld/R3L_workflow.py

+            first_brace = reflection_text.find('{')
+            last_brace = reflection_text.rfind('}')
+
+            if first_brace != -1 and last_brace != -1 and first_brace < last_brace:
+                json_str = reflection_text[first_brace:last_brace + 1]
+            else:
+                json_str = reflection_text
+


This method of finding the first and last brace to extract a JSON string is fragile. It can fail if the model's response includes other text with braces outside the main JSON object. A more robust approach is to use a regular expression to find a JSON block, especially one enclosed in markdown code fences (e.g., json ... ). For example: re.search(r'```json\s*(\{.*?\})\s*```', reflection_text, re.DOTALL).

shiweijiezero and others added 29 commits October 24, 2025 17:53

Init R3L

263da1b

Merge branch 'modelscope:main' into main

d648b6d

Init R3L

177879f

重构R3L以及各类Baseline

55dc215

重构R3L以及各类Baseline

08c8138

Merge branch 'modelscope:main' into main

eb7936e

调整配置

3466de2

Merge branch 'modelscope:main' into featureA

62beb35

Merge branch 'modelscope:main' into featureA

1619b46

调整配置

188e962

更新Alfworld范式

ef2f22a

DAPO config

1d20754

Add Latex Code

5fe327b

添加参考论文

96cfa53

Remove R3L_Fix_Summary.md - will update CLAUDE.md instead

b14c7b9

Restore default webshop path while keeping environment variable option

1d260ed

Keep the original hardcoded path as default, but allow override via WEBSHOP_PATH environment variable for flexibility in different environments.

添加参考论文

dee5c05

merge

893db72

shiweijiezero closed this Nov 14, 2025

shiweijiezero deleted the claude/fix-raft-function-registration-01MAQ5cT7dvcGWgH3eNShzTu branch November 14, 2025 19:16

gemini-code-assist bot reviewed Nov 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Claude/fix raft function registration 01 maq5c t7dvc g wg h3e n shz tu #387

Claude/fix raft function registration 01 maq5c t7dvc g wg h3e n shz tu #387

Uh oh!

shiweijiezero commented Nov 14, 2025

Uh oh!

gemini-code-assist bot commented Nov 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 14, 2025

Uh oh!

gemini-code-assist bot Nov 14, 2025

Uh oh!

gemini-code-assist bot Nov 14, 2025

Uh oh!

gemini-code-assist bot Nov 14, 2025

Uh oh!

gemini-code-assist bot Nov 14, 2025

Uh oh!

gemini-code-assist bot Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	train_data.append({"task_desc": task_desc, "targe": ""})
	train_data.append({"task_desc": task_desc, "target": ""})

	test_data.append({"task_desc": task_desc, "targe": ""})
	test_data.append({"task_desc": task_desc, "target": ""})

Claude/fix raft function registration 01 maq5c t7dvc g wg h3e n shz tu #387

Claude/fix raft function registration 01 maq5c t7dvc g wg h3e n shz tu #387

Uh oh!

Conversation

shiweijiezero commented Nov 14, 2025

Description

Checklist

Uh oh!

gemini-code-assist bot commented Nov 14, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants